This is a three plex experiment using 10-plex TMT with no reference channels per plex. Each plex is a set of biological replicates of developing mouse lens: E15, E18, P0, P3, P6, P9 (six ages, 3 day intervals). Data were downloaded from PRIDE (PXD006381) and processed with the PAW Pipeline (Wilmarth 2009) and IRS normalized using the average of each plex as a mock reference channel (Plubell 2017). This is the publication:
| Age | Description | Biological Replicates | Pooled Mice Lenses |
|---|---|---|---|
| E15 | Embryonic day 15 | 3 | 23 embryos |
| E18 | Embryonic day 18 | 3 | 10 embryos |
| P0 | Newborn | 3 | 8 pups |
| P3 | 3-day old | 3 | 8 pups |
| P6 | 6-day old | 3 | 8 pups |
| P9 | 9-day old | 3 | 8 pups |
| Channel | Plex 1 | Plex 2 | Plex 3 |
|---|---|---|---|
| 126C | E15 | E15 | E15 |
| 127N | E18 | E18 | E18 |
| 127C | P0 | P0 | P0 |
| 128N | P3 | P3 | P3 |
| 128C | P6 | P6 | P6 |
| 129N | P9 | P9 | P9 |
| 129C | Mix? | Mix? | Mix? |
| 130N | unused | unused | unused |
| 130C | unused | unused | unused |
| 131N | unused | unused | unused |
There were three plexes with biological replicates of the 6 ages. There was no pooled internal reference standard. The plex average intensity for each protein was used as a mock reference intensity vector. IRS was used to put all three plexes (18 samples) on a common intensity scale.
Successive ages are pairwise compared here. Statistical analysis is done with edgeR (Robinson 2010A) exact test after TMM normalization (Robinson 2010B).
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
Warning message: “replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘tibble’” ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ── ✔ ggplot2 3.3.2 ✔ purrr 0.3.4 ✔ tibble 3.0.3 ✔ dplyr 1.0.2 ✔ tidyr 1.1.1 ✔ stringr 1.4.0 ✔ readr 1.3.1 ✔ forcats 0.5.0 ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() Attaching package: ‘scales’ The following object is masked from ‘package:purrr’: discard The following object is masked from ‘package:readr’: col_factor Attaching package: ‘psych’ The following objects are masked from ‘package:scales’: alpha, rescale The following objects are masked from ‘package:ggplot2’: %+%, alpha
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE, table = TRUE, title = "TMM Normalized data") {
# computes the tmm normalized data from the DGEList object
# y - DGEList object
# color - color vector for boxplots
# plot - flag to show boxplot or not
# table - flag to print norm factor tables or not
# title - main title for boxplots
# returns a dataframe with normalized intensities
# compute and print "Sample loading" normalization factors
lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
if(table == TRUE) {
cat("\nLibrary size factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
}
# compute and print TMM normalization factors
tmm_facs <- 1/y$samples$norm.factors
if(table == TRUE) {
cat("\nTrimmed mean of M-values (TMM) factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
}
# compute and print the final correction factors
norm_facs <- lib_facs * tmm_facs
if(table == TRUE) {
cat("\nCombined (lib size and TMM) normalization factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))
}
# compute the normalized data as a new data frame
tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = title)
}
tmt_tmm
}
# ============ runs an edgeR exact pairwise test ========================
pairwise_test <- function(delist, pair, p.value) {
# runs an edgeR exact test and returns results
et <- exactTest(delist, pair = pair)
tt <- topTags(et, n = Inf, sort.by = "none")
# print top tags
cat("\n")
print(topTags(et)$table, digits = 4)
cat("\n")
# count candidates
print(summary(decideTestsDGE(et, p.value = p.value)))
# see candidates on an MA plot
plotMD(et, p.value = p.value)
abline(h = c(-1, 1), col = "black")
# check the p-value distribution
pval <- ggplot(tt$table, aes(PValue)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100,
plot = FALSE)$counts[26:100])) +
ggtitle("p-value distribution")
print(pval) # this makes the plot show up
tt # return top-tags object
}
# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
# Computes new columns and extracts some columns to make results frame
# df - data in data.frame
# tt - top tags table from edgeR test
# x - columns for first condition
# xlab - label for x
# y - columns for second condition
# ylab - label for y
# returns a new dataframe
# condition average vectors
ave_x <- rowMeans(df[x])
ave_y <- rowMeans(df[y])
# FC, direction, candidates
fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
direction <- ifelse(ave_y > ave_x, "up", "down")
candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no"))
# make data frame
temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc,
PValue = tt$PValue, FDR = tt$FDR,
ave_x = ave_x, ave_y = ave_y,
direction = direction, candidate = candidate,
Acc = tt$genes))
# fix column headers for averages
names(temp)[names(temp) %in% c("ave_x", "ave_y")] <- str_c("ave_", c(xlab, ylab))
temp # return the data frame
}
# =============== log2_fold_changes ================================================
log2_fold_changes <- function(results, left, right, title) {
# Makes log2 fold change plots by candidate
# results - results data frame
# left - lower FC log2 limit
# right - upper FC log2 limit
# title - plot title
# see how many candidates by category
cat("\n")
print(results %>% count(candidate))
cat("\n")
# plot log2 fold-changes by category
fc <- ggplot(results, aes(x = logFC, fill = candidate)) +
geom_histogram(binwidth=0.1, color = "black") +
facet_wrap(~candidate) +
coord_cartesian(xlim = c(-left, right)) +
ggtitle(title)
print(fc)
}
# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
# Make data frame with some transformed columns
# results - results data frame
# x - columns for x condition
# y - columns for y condition
# return new data frame
df <- data.frame(log10((results[x] + results[y])/2),
log2(results[y] / results[x]),
results$candidate,
-log10(results$FDR))
colnames(df) <- c("A", "M", "candidate", "P")
df # return the data frame
}
# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
# makes MA-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# uses transformed data
temp <- transform(results, x, y)
# 2-fold change lines
ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))
# make main MA plot
ma <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
scale_x_continuous("Ave_intensity") +
ggtitle(title) +
ma_lines
# make separate MA plots
ma_facet <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
scale_x_continuous("log10 Ave_intensity") +
ma_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(ma)
print(ma_facet)
}
# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
# makes scatter-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# 2-fold change lines
scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
scale_y_log10(),
scale_x_log10())
# make main scatter plot
scatter <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle(title) +
scatter_lines
# make separate scatter plots
scatter_facet <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
scatter_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(scatter)
print(scatter_facet)
}
# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
# makes a volcano plot
# results - a data frame with edgeR results
# x - string for the x-axis column
# y - string for y-axis column
# title - plot title string
# uses transformed data
temp <- transform(results, x, y)
# build the plot
ggplot(temp, aes(x = M, y = P)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("log2 FC") +
ylab("-log10 FDR") +
ggtitle(str_c(title, " Volcano Plot"))
}
# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
identifier <- str_split(accession, "\\|", simplify = TRUE)
identifier[,3]
# identifier <- accession
}
set_plot_dimensions <- function(width_choice, height_choice) {
options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}
plot_top_tags <- function(results, nleft, nright, top_tags, color = c("red", "blue")) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# top_tags is number of up and number of down top DE candidates to plot
# get top ipregulated
up <- results %>%
filter(logFC >= 0) %>%
arrange(FDR)
up <- up[1:top_tags, ]
# get top down regulated
down <- results %>%
filter(logFC < 0) %>%
arrange(FDR)
down <- down[1:top_tags, ]
# pack them
proteins <- rbind(up, down)
color_bars = c(rep(color[1], nleft), rep(color[2], nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2),
", FDR: ", scientific(row$FDR, digits = 3),
", FC: ", round(row$FC, digits = 1),
", ", row$candidate)
barplot(vec, col = color_bars, main = title,
cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
}
}
# ============== CV function ===================================================
CV <- function(df) {
# Computes CVs of data frame rows
# df - data frame,
# returns vector of CVs (%)
ave <- rowMeans(df) # compute averages
sd <- apply(df, 1, sd) # compute standard deviations
cv <- 100 * sd / ave # compute CVs in percent (last thing gets returned)
}
We checked the data quality in an earlier notebook, and it looked okay.
# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt",
guess_max = 5854)
# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
data_sl <- select(data_all, contains("SLNorm_"))
data_irs <- select(data_all, contains("IRSNorm_"))
# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession
# see how many rows of data we have
length(accessions)
# set some larger plot dimensions
set_plot_dimensions(9, 9)
Parsed with column specification: cols( .default = col_double(), Accession = col_character(), Identical = col_character(), Similar = col_character(), OtherLoci = col_character(), Filter = col_character(), Missing = col_character(), Description = col_character() ) See spec(...) for full column specifications.
# define the groups (after IRS)
E15_irs <- select(data_irs, contains("_E15_"))
E18_irs <- select(data_irs, contains("_E18_"))
P0_irs <- select(data_irs, contains("_P0_"))
P3_irs <- select(data_irs, contains("_P3_"))
P6_irs <- select(data_irs, contains("_P6_"))
P9_irs <- select(data_irs, contains("_P9_"))
# put groups together into a single data frame
tmt_data <- bind_cols(E15_irs, E18_irs, P0_irs, P3_irs, P6_irs, P9_irs)
# Set color vector:
colors = c(rep('dark red', 3), rep('red', 3), rep('dark blue', 3),
rep('blue', 3), rep('dark green', 3), rep('green', 3))
# make indexes for ages
E15 <- 1:3
E18 <- 4:6
P0 <- 7:9
P3 <- 10:12
P6 <- 13:15
P9 <- 16:18
We are defining the groups that will be compared explicitly and using all the samples for variance estimates. We put the data into a data frame, grouped by condition. We defined some column indexes for each condition, set some colors for plotting. We can run TMM normalization, check the final intensity distributions, and see how distinct the biological groups are in an MDS cluster view.
# get the biological sample data into a DGEList object
group <- c(rep('E15', 3), rep('E18', 3), rep('P0', 3),
rep('P3', 3), rep('P6', 3), rep('P9', 3))
y <- DGEList(counts = tmt_data, group = group, genes = accessions)
# run TMM normalization so we have the results in y
y <- calcNormFactors(y)
tmt_tmm <- apply_tmm_factors(y, table = TRUE, color = colors)
# check the clustering
plotMDS(y, col = colors, main = "Data after TMM")
Library size factors: IRSNorm_E15_1 -> 1.031476 IRSNorm_E15_2 -> 1.008094 IRSNorm_E15_3 -> 0.986400 IRSNorm_E18_1 -> 1.010526 IRSNorm_E18_2 -> 1.012694 IRSNorm_E18_3 -> 0.997802 IRSNorm_P0_1 -> 0.998726 IRSNorm_P0_2 -> 0.991580 IRSNorm_P0_3 -> 0.999595 IRSNorm_P3_1 -> 0.986234 IRSNorm_P3_2 -> 0.999611 IRSNorm_P3_3 -> 1.005907 IRSNorm_P6_1 -> 0.980859 IRSNorm_P6_2 -> 0.993474 IRSNorm_P6_3 -> 1.011724 IRSNorm_P9_1 -> 0.974078 IRSNorm_P9_2 -> 1.004781 IRSNorm_P9_3 -> 1.009580 Trimmed mean of M-values (TMM) factors: IRSNorm_E15_1 -> 0.712297 IRSNorm_E15_2 -> 0.660870 IRSNorm_E15_3 -> 0.666675 IRSNorm_E18_1 -> 0.839775 IRSNorm_E18_2 -> 0.704975 IRSNorm_E18_3 -> 0.849973 IRSNorm_P0_1 -> 0.908618 IRSNorm_P0_2 -> 0.896906 IRSNorm_P0_3 -> 0.880678 IRSNorm_P3_1 -> 1.077557 IRSNorm_P3_2 -> 1.213998 IRSNorm_P3_3 -> 1.069485 IRSNorm_P6_1 -> 1.224558 IRSNorm_P6_2 -> 1.314099 IRSNorm_P6_3 -> 1.317240 IRSNorm_P9_1 -> 1.463104 IRSNorm_P9_2 -> 1.354746 IRSNorm_P9_3 -> 1.501006 Combined (lib size and TMM) normalization factors: IRSNorm_E15_1 -> 0.734718 IRSNorm_E15_2 -> 0.666219 IRSNorm_E15_3 -> 0.657608 IRSNorm_E18_1 -> 0.848614 IRSNorm_E18_2 -> 0.713924 IRSNorm_E18_3 -> 0.848105 IRSNorm_P0_1 -> 0.907461 IRSNorm_P0_2 -> 0.889354 IRSNorm_P0_3 -> 0.880321 IRSNorm_P3_1 -> 1.062723 IRSNorm_P3_2 -> 1.213527 IRSNorm_P3_3 -> 1.075802 IRSNorm_P6_1 -> 1.201120 IRSNorm_P6_2 -> 1.305523 IRSNorm_P6_3 -> 1.332683 IRSNorm_P9_1 -> 1.425177 IRSNorm_P9_2 -> 1.361223 IRSNorm_P9_3 -> 1.515386
The TMM factors range from 0.7 to 1.5 suggesting some large compositional changes across the developmental ages. This testing is mostly a categorization exercise. We will compare the blue group (E15) to the red group (E18 and P0) and compare the red group (E18 and P0) to the dark green group (P3 and P6). The pairwise comparisons may capture some of the major developmental changes. Focusing on just two comparisons may make interpreting the results a little easier.
One of the more powerful features of edgeR is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 18 samples across all ages to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.
We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take specified conditions and compare them using the normalization factors and dispersion estimates saved in the DGEList object y.
# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS/TMM normalized data")
Design matrix not provided. Switch to the classic mode.
Compare E15 to E18 lenses.
We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The topTags table show us the most significant candidates (top 10). We can also summarize how many up and down regulated candidates we have at an FDR of 0.10. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates and check the p-value distribution.
# use a 0.05 cutoff
tt = pairwise_test(y, c("E15", "E18"), 0.05)
genes logFC logCPM PValue FDR
410 sp|P07356|ANXA2_MOUSE 1.1933 8.413 2.243e-16 9.532e-13
402 sp|Q8BFX1|RN187_MOUSE 2.3372 8.140 1.644e-12 3.493e-09
1374 sp|A2AWP0|BIRC7_MOUSE 1.8859 5.903 2.784e-11 3.943e-08
389 sp|Q64448|CXA3_MOUSE 1.0867 8.097 4.317e-11 4.585e-08
1376 sp|P05132|KAPCA_MOUSE 1.2165 5.591 5.800e-11 4.928e-08
1906 sp|Q9DC51|GNAI3_MOUSE 0.8438 4.563 1.226e-10 8.680e-08
1188 sp|O35658|C1QBP_MOUSE -0.8176 5.942 2.783e-10 1.689e-07
848 sp|P62317|SMD2_MOUSE -0.8504 6.796 5.362e-10 2.848e-07
503 sp|P07091|S10A4_MOUSE 1.4201 8.023 1.801e-09 8.501e-07
1247 sp|Q9WVA3|BUB3_MOUSE -0.8315 5.475 2.188e-09 8.987e-07
E18-E15
Down 254
NotSig 3705
Up 290
There is a nice balance between up (red) and down (blue) candidates (about 250 each) with 3.7K unchanged proteins. Many candidates have abundance levels that are different from the unchanged proteins (black points in the MA plot). The black points are more diffuse than we see in other types of samples. TMT labeling tends to be rather precise and have most proteins with relatively small fold changes. Most proteins in the developing mouse lens seem to be changing to some degree.
The p-value distribution has a typical pattern of two distributions. There is a uniform (flat) distribution of p-values from 0 to 1 from chance associated with unchanged proteins. There is a sharp excess of small p-values that come from the (putative) DE candidates.
We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and accumulate all comparisons into all_results.
We will make MA plots, scatter plots, and a volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.
# get the results summary
results <- collect_results(tmt_tmm, tt$table, E15, "E15", E18, "E18")
# make column names unique by adding comparison (for the accumulated frame)
results_temp <- results
colnames(results_temp) <- str_c(colnames(results), "_E15_E18")
# accumulate testing results
all_results <- cbind(accessions, results_temp)
# count candidates and check log2 fold change distributions
log2_fold_changes(results, 3, 3, "E15 vs E18 lenses logFC distributions by candidate")
candidate n 1 high 280 2 med 264 3 low 216 4 no 3489
We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.
# make MA plots
MA_plots(results, "ave_E15", "ave_E18", "E15 vs E18 lenses")
We see a nice pattern of increasing fold changes as we cut candidates by increasingly stricter FDR (Benjamini-Hochberg corrected p-values) thresholds. This is more easily seen in the facetted plots (the second panel). The 241 "high" candidates (FDR < 0.01) in orange (upper left) look well-separated from the unchanged proteins (purple points in lower right).
# make scatter plots
scatter_plots(results, "ave_E15", "ave_E18", "E15 vs E18 lenses")
# make a volcano plot
volcano_plot(results, "ave_E15", "ave_E18", "E15 vs E18 lenses")
We will look at the top 50 poteins by p-value in each direction since we have a moderate number of candidates.
# look at the top candidates
set_plot_dimensions(9, 3.5)
n_left <- 3
n_right <- 3
top_tags <- 50
color <- c("dark red", "red")
plot_top_tags(results, n_left, n_right, top_tags, color)
set_plot_dimensions(9, 9)
The age grouping that emerged from the cumulative abundance distribution curves (and was also apparent in the MDS clustering first dimension separations) results in convincing abundance differences. E15 is an early age and seems to differ from the later ages.
Compare E18 to P0 lenses.
We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The topTags table show us the most significant candidates (top 10). We can also summarize how many up and down regulated candidates we have at an FDR of 0.10. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates and check the p-value distribution.
# use a 0.05 cutoff
tt = pairwise_test(y, c("E18", "P0"), 0.05)
genes logFC logCPM PValue FDR
2188 sp|Q923D2|BLVRB_MOUSE -1.3981 3.430 2.101e-07 0.0008927
2291 sp|P04919|B3AT_MOUSE -1.6904 3.856 1.663e-06 0.0024045
2033 sp|O55042|SYUA_MOUSE -1.7627 4.586 1.698e-06 0.0024045
54 sp|P04342|CRGD_MOUSE 1.0560 11.109 5.348e-06 0.0056814
11 sp|Q9CXV3|CRGF_MOUSE_family 0.5879 13.209 6.436e-05 0.0546891
113 sp|P62267|RS23_MOUSE -0.3910 9.485 9.642e-05 0.0682810
1009 sp|Q8JZK9|HMCS1_MOUSE -0.5883 6.320 1.886e-04 0.1078965
887 sp|P00920|CAH2_MOUSE -1.3940 6.225 2.056e-04 0.1078965
216 sp|P14115|RL27A_MOUSE -0.4510 9.176 2.285e-04 0.1078965
193 sp|O09167|RL21_MOUSE -0.4256 9.175 3.758e-04 0.1593103
P0-E18
Down 3
NotSig 4245
Up 1
E18 and P0 are very similar - there are few, if any, DE candidates. The unchanged proteins (black points) are also rather diffuse (although it is narrower than in the E15 -vs- E18 comparison). The p-value distribution is (mostly) flat.
We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and accumulate all comparisons into all_results.
We will make MA plots, scatter plots, and a volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.
# get the results summary
results <- collect_results(tmt_tmm, tt$table, E18, "E18", P0, "P0")
# make column names unique by adding comparison (for the accumulated frame)
results_temp <- results
colnames(results_temp) <- str_c(colnames(results), "_E18_P0")
# accumulate testing results
all_results <- cbind(all_results, results_temp)
# count candidates and check log2 fold change distributions
log2_fold_changes(results, 3, 3, "E18 vs P0 lenses logFC distributions by candidate")
candidate n 1 high 4 2 low 2 3 no 4243
We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.
# make MA plots
MA_plots(results, "ave_E18", "ave_P0", "E18 vs P0 lenses")
# make scatter plots
scatter_plots(results, "ave_E18", "ave_P0", "E18 vs P0 lenses")
# make a volcano plot
volcano_plot(results, "ave_E18", "ave_P0", "E18 vs P0 lenses")
We will look at the top 10 by p-value in each direction. Some proteins will not be statistically significant.
# look at the top candidates
set_plot_dimensions(9, 3.5)
n_left <- 3
n_right <- 3
top_tags <- 10
color <- c("red", "dark blue")
plot_top_tags(results, n_left, n_right, top_tags, color)
set_plot_dimensions(9, 9)
Compare P0 to P3 lenses.
We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The topTags table show us the most significant candidates (top 10). We can also summarize how many up and down regulated candidates we have at an FDR of 0.10. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates and check the p-value distribution.
# use a 0.05 cutoff
tt = pairwise_test(y, c("P0", "P3"), 0.05)
genes logFC logCPM PValue FDR
234 tr|E9Q616|E9Q616_MOUSE 1.0632 9.289 6.892e-19 2.928e-15
772 sp|Q8K0S2|ERIC5_MOUSE 0.7421 7.402 2.753e-07 5.850e-04
172 sp|Q62318|TIF1B_MOUSE -0.6615 9.319 5.171e-07 6.584e-04
801 sp|Q8C4U3|SFRP1_MOUSE 0.6647 6.889 6.199e-07 6.584e-04
366 sp|Q99MD9|NASP_MOUSE -0.8705 8.174 8.471e-07 7.199e-04
811 sp|Q3ULJ0|GPD1L_MOUSE 0.7959 6.807 1.432e-06 1.014e-03
751 sp|Q61581|IBP7_MOUSE 0.6940 7.165 1.783e-06 1.082e-03
1191 sp|Q9QYH6|MAGD1_MOUSE 0.7275 6.081 2.313e-06 1.228e-03
1057 sp|Q920Q8|NS1BP_MOUSE 0.6036 6.075 3.143e-06 1.484e-03
468 sp|P18242|CATD_MOUSE -0.6026 7.897 9.826e-06 3.944e-03
P3-P0
Down 43
NotSig 4165
Up 41
There are a modest number of DE candidates. The separation of the red and blue points from the black points is not very very good. There are many black points with similar fold changes that are not statistically significant and the unchanged proteins are diffuse (not very tight to the 0-fold-change line). The p-value distribution looks pretty good.
We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and accumulate all comparisons into all_results.
We will make MA plots, scatter plots, and a volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.
# get the results summary
results <- collect_results(tmt_tmm, tt$table, P0, "P0", P3, "P3")
# make column names unique by adding comparison (for the accumulated frame)
results_temp <- results
colnames(results_temp) <- str_c(colnames(results), "_P0_P3")
# accumulate testing results
all_results <- cbind(all_results, results_temp)
# count candidates and check log2 fold change distributions
log2_fold_changes(results, 3, 3, "P0 vs P3 lenses logFC distributions by candidate")
candidate n 1 high 27 2 med 57 3 low 41 4 no 4124
We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.
# make MA plots
MA_plots(results, "ave_P0", "ave_P3", "P0 vs P3 lenses")
# make scatter plots
scatter_plots(results, "ave_P0", "ave_P3", "P0 vs P3 lenses")
# make a volcano plot
volcano_plot(results, "ave_P0", "ave_P3", "P0 vs P3 lenses")
We will look at the top 30 by p-value in each direction.
# look at the top candidates
set_plot_dimensions(9, 3.5)
n_left <- 3
n_right <- 3
top_tags <- 30
color <- c("dark blue", "blue")
plot_top_tags(results, n_left, n_right, top_tags, color)
set_plot_dimensions(9, 9)
Compare P3 to P6 lenses.
We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The topTags table show us the most significant candidates (top 10). We can also summarize how many up and down regulated candidates we have at an FDR of 0.10. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates and check the p-value distribution.
# use a 0.05 cutoff
tt = pairwise_test(y, c("P3", "P6"), 0.05)
genes logFC logCPM PValue FDR
2018 sp|Q8VCM7|FIBG_MOUSE -1.0575 4.612 1.274e-06 0.005411
832 sp|P11276|FINC_MOUSE -0.7830 6.592 3.063e-05 0.042894
239 sp|O70250|PGAM2_MOUSE 0.8013 9.239 3.808e-05 0.042894
1386 sp|Q8K0E8|FIBB_MOUSE -1.1407 5.437 4.038e-05 0.042894
3366 sp|Q01149|CO1A2_MOUSE -1.8818 2.032 9.603e-05 0.059945
101 sp|P16460|ASSY_MOUSE 0.8036 10.492 9.885e-05 0.059945
259 sp|Q6ZWV3|RL10_MOUSE 0.4147 8.733 1.085e-04 0.059945
1444 sp|Q06890|CLUS_MOUSE 0.6037 5.683 1.129e-04 0.059945
1357 sp|E9PV24|FIBA_MOUSE -1.3047 5.360 1.978e-04 0.093363
451 sp|O70400|PDLI1_MOUSE 0.4705 8.258 2.592e-04 0.110128
P6-P3
Down 3
NotSig 4245
Up 1
There are very few DE candidates, the unchanged proteins are less diffuse, and the p-value distribution is flat.
We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and accumulate all comparisons into all_results.
We will make MA plots, scatter plots, and a volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.
# get the results summary
results <- collect_results(tmt_tmm, tt$table, P3, "P3", P6, "P6")
# make column names unique by adding comparison (for the accumulated frame)
results_temp <- results
colnames(results_temp) <- str_c(colnames(results), "_P3_P6")
# accumulate testing results
all_results <- cbind(all_results, results_temp)
# count candidates and check log2 fold change distributions
log2_fold_changes(results, 3, 3, "P3 vs P6 lenses logFC distributions by candidate")
candidate n 1 high 1 2 med 3 3 low 5 4 no 4240
We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.
# make MA plots
MA_plots(results, "ave_P3", "ave_P6", "P3 vs P6 lenses")
# make scatter plots
scatter_plots(results, "ave_P3", "ave_P6", "P3 vs P6 lenses")
# make a volcano plot
volcano_plot(results, "ave_P3", "ave_P6", "P3 vs P6 lenses")
We will look at the top 10 by p-value in each direction. Some proteins will not be statistically significant.
# look at the top candidates
set_plot_dimensions(9, 3.5)
n_left <- 3
n_right <- 3
top_tags <- 10
color <- c("blue", "dark green")
plot_top_tags(results, n_left, n_right, top_tags, color)
set_plot_dimensions(9, 9)
Compare P6 to P9 lenses.
We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The topTags table show us the most significant candidates (top 10). We can also summarize how many up and down regulated candidates we have at an FDR of 0.10. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates and check the p-value distribution.
# use a 0.05 cutoff
tt = pairwise_test(y, c("P6", "P9"), 0.05)
genes logFC logCPM PValue FDR
2883 sp|Q6UGQ3|SG2B2_MOUSE 5.1479 4.013 9.154e-24 3.889e-20
1041 sp|O09131|GSTO1_MOUSE 1.9355 6.570 5.428e-22 1.153e-18
1329 sp|P21460|CYTC_MOUSE 2.0749 6.110 5.159e-20 7.307e-17
1369 sp|Q9D3H2|OBP1A_MOUSE 4.3698 5.545 3.789e-09 4.024e-06
1536 tr|Q497J0|Q497J0_MOUSE 2.8936 6.037 3.968e-07 3.372e-04
1102 sp|P49194|RET3_MOUSE 1.9470 6.331 5.990e-07 4.055e-04
1810 tr|O35176|O35176_MOUSE 4.7616 4.924 6.680e-07 4.055e-04
1980 sp|Q3UW53|NIBA1_MOUSE -1.3288 4.616 1.788e-05 9.496e-03
452 sp|Q61555|FBN2_MOUSE -0.9121 8.098 4.816e-05 2.274e-02
1444 sp|Q06890|CLUS_MOUSE 0.6138 5.683 8.646e-05 3.674e-02
P9-P6
Down 2
NotSig 4239
Up 8
There are very few DE candidates and the distribution of unchanged proteins (black points) is much tighter along the 0-fold-change horizontal line. The p-value distribution is flat-ish. The mouse lens proteome seems to be stabilizing with age (maybe accumulating lens proteins make it harder to see changes).
The "up in P9 versus P6" proteins have an odd pattern. The large fold-changes at similar protein intensities look suspicious (many are secreted proteins - contaminant issue?)
We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and accumulate all comparisons into all_results.
We will make MA plots, scatter plots, and a volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.
# get the results summary
results <- collect_results(tmt_tmm, tt$table, P6, "P6", P9, "P9")
# make column names unique by adding comparison (for the accumulated frame)
results_temp <- results
colnames(results_temp) <- str_c(colnames(results), "_P6_P9")
# accumulate testing results
all_results <- cbind(all_results, results_temp)
# count candidates and check log2 fold change distributions
log2_fold_changes(results, 3, 3, "P6 vs P9 lenses logFC distributions by candidate")
candidate n 1 high 8 2 med 2 3 low 2 4 no 4237
We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.
# make MA plots
MA_plots(results, "ave_P6", "ave_P9", "P6 vs P9 lenses")
# make scatter plots
scatter_plots(results, "ave_P6", "ave_P9", "P6 vs P9 lenses")
# make a volcano plot
volcano_plot(results, "ave_P6", "ave_P9", "P6 vs P9 lenses")
We will look at the top 10 by p-value in each direction. Some proteins will not be statistically significant.
# look at the top candidates
set_plot_dimensions(9, 3.5)
n_left <- 3
n_right <- 3
top_tags <- 10
color <- c("dark green", "green")
plot_top_tags(results, n_left, n_right, top_tags, color)
set_plot_dimensions(9, 9)
Contaminants typcally vary sample to sample. Most of these top candidates has conssitent abundance levels across the 3 samples at each age.
The age grouping that emerged from the cumulative abundance distribution curves (and was also apparent in the MDS clustering first dimension separations) seems to be confirmed in the pairwise statistical testing. We had many candidates between E15 and E18 ages. There were few candidates between E18 and P0 (suggesting E18 and P0 could be grouped). There were moderate numbers of candidates between P0 and P3, indicating some changes. There were few candidates between P3/P6 and P6/P9 suggesting that the 3 older ages are similar.
This lens time course is simpler than most. We have three major cell populations: epithelial cells (monolayer on the anterior face of the lens), differentiating cortex cells (the outer layers of the lens (elongating cells with nuclei)), and the mature fiber cells (bags of crystallins and other major lens proteins). There is little protein turn over in lens and it is well known that crystallins progressively dominate the proteome as the lens matures.
There will be some lens growth during the developmental ages. If we think about the lens as a sphere, epithelial cells grow like the surface area (R^2) and mature fiber cells grow like the volume (R^3). Major lens proteins should go up across the ages, epithelial cell proteins should go down. Proteins from the cortex probably go down, but none of these cell populations are homogeneous. Epithelial cells at the start of differentiation will be more typical cells. Mature fiber cells are mostly major lens proteins with regular cell machinery and structures degraded. The cortex cells span this spectrum.
all_results frame to TSV file¶We collected the normalized data and statistical testing results and we can save that table to make a nice spreadsheet summary.
write.table(all_results, "Khan-2018_results_successive-ages_edgeR-exact_TMM.txt", sep = "\t",
row.names = FALSE, na = " ")
sessionInfo()
R version 3.5.3 (2019-03-11) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS 10.16 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] psych_2.0.7 edgeR_3.24.3 limma_3.38.3 scales_1.1.1 [5] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 [9] readr_1.3.1 tidyr_1.1.1 tibble_3.0.3 ggplot2_3.3.2 [13] tidyverse_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 locfit_1.5-9.4 lubridate_1.7.9 lattice_0.20-41 [5] assertthat_0.2.1 digest_0.6.25 IRdisplay_0.7.0 R6_2.4.1 [9] cellranger_1.1.0 repr_1.1.0 backports_1.1.8 reprex_0.3.0 [13] evaluate_0.15 httr_1.4.2 pillar_1.4.6 rlang_1.0.2 [17] uuid_0.1-4 readxl_1.3.1 rstudioapi_0.11 blob_1.2.1 [21] labeling_0.3 splines_3.5.3 munsell_0.5.0 broom_0.7.0 [25] compiler_3.5.3 modelr_0.1.8 pkgconfig_2.0.3 base64enc_0.1-3 [29] mnormt_1.5-6 htmltools_0.4.0 tidyselect_1.1.0 crayon_1.3.4 [33] dbplyr_1.4.4 withr_2.2.0 grid_3.5.3 nlme_3.1-148 [37] jsonlite_1.7.0 gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0 [41] magrittr_1.5 cli_3.3.0 stringi_1.4.6 farver_2.0.3 [45] fs_1.5.0 xml2_1.3.2 ellipsis_0.3.2 generics_0.0.2 [49] vctrs_0.4.1 IRkernel_1.1.1 tools_3.5.3 glue_1.6.2 [53] hms_0.5.3 parallel_3.5.3 colorspace_1.4-1 rvest_0.3.6 [57] pbdZMQ_0.3-3 haven_2.3.1